15 research outputs found

    PicShark: mitigating metadata scarcity through large-scale P2P collaboration

    Get PDF
    With the commoditization of digital devices, personal information and media sharing is becoming a key application on the pervasive Web. In such a context, data annotation rather than data production is the main bottleneck. Metadata scarcity represents a major obstacle preventing efficient information processing in large and heterogeneous communities. However, social communities also open the door to new possibilities for addressing local metadata scarcity by taking advantage of global collections of resources. We propose to tackle the lack of metadata in large-scale distributed systems through a collaborative process leveraging on both content and metadata. We develop a community-based and self-organizing system called PicShark in which information entropy—in terms of missing metadata—is gradually alleviated through decentralized instance and schema matching. Our approach focuses on semi-structured metadata and confines computationally expensive operations to the edge of the network, while keeping distributed operations as simple as possible to ensure scalability. PicShark builds on structured Peer-to-Peer networks for distributed look-up operations, but extends the application of self-organization principles to the propagation of metadata and the creation of schema mappings. We demonstrate the practical applicability of our method in an image sharing scenario and provide experimental evidences illustrating the validity of our approac

    PicShark: Mitigating Metadata Scarcity Through Large-Scale P2P Collaboration

    Get PDF
    Abstract With the commoditization of digital devices, personal information and media sharing is becoming a key application on the pervasive Web. In such a context, data annotation rather than data production is the main bottleneck. Metadata scarcity represents a major obstacle preventing effcient information processing in large and heterogeneous communities. However, social communities also open the door to new possibilities for addressing local metadata scarcity by taking advantage of global collections of resources. We propose to tackle the lack of metadata in large-scale distributed systems through a collaborative process leveraging on both content and metadata. We develop a community-based and self-organizing system called PicShark in which information entropy in terms of missing metadata is gradually alleviated through decentralized instance and schema matching. Our approach focuses on semi- structured metadata and confines computationally expensive operations to the edge of the network, while keeping distributed operations as simple as possible to ensure scalability. PicShark builds on structured Peer-to-Peer networks for distributed look-up operations, but extends the application of self-organization principles to the propagation of metadata and the creation of schema mappings. We demonstrate the practical applicability of our method in an image sharing scenario and provide experimental evidences illustrating the validity of our approach

    To Tag or Not to Tag ? Harvesting Adjacent Metadata in Large-Scale Tagging Systems

    Get PDF
    We present HAMLET, a suite of principles, scoring models and algorithms to automatically propagate metadata along edges in a document neighborhood. As a showcase scenario we consider tag prediction in community-based Web 2.0 tagging applications. Experiments using real-world data demonstrate the viability of our approach in large-scale environments where tags are scarce. To the best of our knowledge, HAMLET is the first system to promote an efficient and precise reuse of shared metadata in highly dynamic, large-scale Web 2.0 tagging systems

    Neighborhood-based Tag Prediction

    Get PDF
    We consider the problem of tag prediction in collaborative tagging systems where users share and annotate resources on the Web. We put forward HAMLET, a novel approach to automatically propagate tags along the edges of a graph which relates similar documents. We identify the core principles underlying tag propagation for which we derive suitable scoring models combined in one overall ranking formula. Leveraging these scores, we present an effcient top-k tag selection algorithm that infers additional tags by carefully inspecting neighbors in the document graph. Experiments using real-world data demonstrate the viability of our approach in large-scale environments where tags are scarce

    From Web 1.0 to Web 2.0 and back - How did your Grandma use to tag?

    Get PDF
    We consider the applicability of terms extracted from anchortext as a source of Web page descriptions in the form of tags. With a relatively simple and easy-to-use method, we show that anchortext significantly overlaps with tags obtained from the popular tagging portal del.icio.us. Considering the size and diversity of the user community potentially involved in social tagging, this observation is rather surprising. Furthermore, we show by an evaluation using human-created relevance assessments the general suitability of the anchortext tag generation in terms of user-perceived precision values. The awareness of this easy-to-obtain source of tags could trigger the rise of new tagging portals pushed by this automatic bootstrapping process or be applied in already existing portals to increase the number of tags per page by merely looking at the anchortext which exists anyway

    Leveraging User-Generated Content for Information Discovery on the Web

    No full text
    The large-scale adoption of the Web 2.0 paradigm has revolutionized the way we interact with the Web today. End-users, so far mainly passive consumers of information are now becoming active information producers, creating, uploading, and commenting on all types of digital content. As a consequence, the Web has evolved from a collection of static HTML pages to a highly interactive system, where information is being published and consumed at high rates. This has tremendously increased the amount of data available on the Web today, which brings about new challenges in terms of information management. At the same time, the increased user participation represents a new and extremely valuable source of data. While interacting with different Web 2.0 portals, users freely provide all types of information, such as annotations describing the shared resources, friendship links connecting similar users, etc., which can be exploited in order to improve the methods designed to manage online content. A particularly interesting example of user-generated data are the so-called social annotations, that users attach to resources in the context of collaborative tagging systems. This kind of meta-information opens up new opportunities for improved content search, new means to organize personal data, and ways of mining user profiles based on their annotations. Virtual friendship connections between users, as we can observe in social networks, are another rich source of information as they often group users with similar interests together, give means to study information diffusion and open ground to enhanced expert finding tasks. In this thesis, we leverage information extracted from user-generated data, in order to solve current information management problems, such as data retrieval, mining and integration. We explore different scenarios, where online content is enriched with user-defined meta-information and we identify specific problems, which we solve by leveraging this information. We start by addressing the problem of context-based information discovery in collaborative tagging systems, where we take advantage of user-defined entity graphs – such as a citation graph of publications or a friendship graph of users. In this setting, effective search solutions require a certain amount of annotations, however, content is often poorly annotated. We therefore propose a method that exploits the context-related information embedded in the graph structure, in order to automatically infer new annotations. Our approach propagates tags along the edges of the graph, based on the assumption that the neighborhood of a resource holds additional information about the resource itself. We see a similar graph structure in social communities, where users are connected via friendship links and where the neighborhood of a user reflects her community of interest. We adopt the hypothesis that users mainly annotate resources of interest to them and interpret the annotations (i.e., tags) as an interest profile. Hence, we propose a novel framework for tag-based community detection in collaborative tagging systems, that considers both the tagging behavior of users, as well as their friendship graph. Based on a set of tags, our method returns a closely connected community of users, whose tags jointly cover the initial set. In order to further investigate the issue of generating user profiles based on their annotations, we switch our attention from the open Web to an enterprise setting. Social software, such as collaborative tagging systems, has also been included in the enterprise space, where it opens up new opportunities to address the problem of expert search. We take advantage of the data extracted from two enterprise-internal portals and explore correlations between the users‚ tagging behavior and their corresponding areas of expertise. Based on these correlations, we devise a method that derives expertise profiles for users. Finally, we investigate how user-generated meta-information can be exploited in the domain of structured data on the Web, i.e., data that is organized according to the relational model and complies to application-specific schemas. In order to enable transparent search solutions on such data, schema heterogeneity needs to be overcome by means of data integration techniques. We explore how such techniques can benefit from the user-generated meta-information in the form of links between similar entries in different databases. Based on these links, we devise a method to create mappings between the elements of different schemas from a real-world collection of online bioinformatic databases
    corecore